Inroduction

xy


Prepare workplace

Install libraries

Read data

We import the csv file “loan_sample_9.csv” and make a copy of it to ensure that we don’t mess up the original dataset.

data_loans <- read_csv("loan_sample_9.csv")
## Rows: 40000 Columns: 17
## -- Column specification --------------------------------------------------------
## Delimiter: ","
## chr  (5): grade, home_ownership, verification_status, purpose, application_type
## dbl (12): loan_amnt, int_rate, annual_inc, dti, open_acc, revol_bal, revol_u...
## 
## i Use `spec()` to retrieve the full column specification for this data.
## i Specify the column types or set `show_col_types = FALSE` to quiet this message.
data <- data_loans

Descriptive analysis

Check the data

In the first step we explore the data. We start by investigating the structure of the data set. There are 12 numeric and 5 categorical variables in the dataset. But the numeric variable “Status” with its values “1” and “0” looks like a factor and all the characteristic variables also look like factors.

## # A tibble: 6 x 17
##   loan_amnt int_rate grade home_ownership annual_inc verification_status purpose
##       <dbl>    <dbl> <chr> <chr>               <dbl> <chr>               <chr>  
## 1      6000     18.2 D     RENT                90000 Not Verified        debt_c~
## 2      8000     13.3 C     MORTGAGE            70000 Verified            home_i~
## 3      6000     14.0 C     MORTGAGE            54000 Source Verified     debt_c~
## 4      1500     15.6 D     RENT                53000 Not Verified        credit~
## 5      7000     10.1 B     RENT                65000 Not Verified        debt_c~
## 6      5000     12.7 C     RENT                37000 Not Verified        debt_c~
## # i 10 more variables: dti <dbl>, open_acc <dbl>, revol_bal <dbl>,
## #   revol_util <dbl>, total_acc <dbl>, total_rec_int <dbl>,
## #   application_type <chr>, tot_cur_bal <dbl>, total_rev_hi_lim <dbl>,
## #   Status <dbl>
## # A tibble: 6 x 17
##   loan_amnt int_rate grade home_ownership annual_inc verification_status purpose
##       <dbl>    <dbl> <chr> <chr>               <dbl> <chr>               <chr>  
## 1      2000     8.18 B     RENT                47000 Source Verified     credit~
## 2      6000    14.5  C     RENT                38000 Source Verified     debt_c~
## 3      2500     9.93 B     OWN                 23000 Not Verified        other  
## 4     16000    19.0  D     RENT                60000 Source Verified     debt_c~
## 5      7000     9.17 B     RENT                34000 Source Verified     small_~
## 6     14400    17.0  D     MORTGAGE           110000 Source Verified     debt_c~
## # i 10 more variables: dti <dbl>, open_acc <dbl>, revol_bal <dbl>,
## #   revol_util <dbl>, total_acc <dbl>, total_rec_int <dbl>,
## #   application_type <chr>, tot_cur_bal <dbl>, total_rev_hi_lim <dbl>,
## #   Status <dbl>
## spc_tbl_ [40,000 x 17] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ loan_amnt          : num [1:40000] 6000 8000 6000 1500 7000 ...
##  $ int_rate           : num [1:40000] 18.2 13.3 14 15.6 10.1 ...
##  $ grade              : chr [1:40000] "D" "C" "C" "D" ...
##  $ home_ownership     : chr [1:40000] "RENT" "MORTGAGE" "MORTGAGE" "RENT" ...
##  $ annual_inc         : num [1:40000] 90000 70000 54000 53000 65000 37000 70000 36000 40000 15000 ...
##  $ verification_status: chr [1:40000] "Not Verified" "Verified" "Source Verified" "Not Verified" ...
##  $ purpose            : chr [1:40000] "debt_consolidation" "home_improvement" "debt_consolidation" "credit_card" ...
##  $ dti                : num [1:40000] 25.67 6.72 13.16 16.85 2.36 ...
##  $ open_acc           : num [1:40000] 15 8 9 5 7 6 7 12 8 7 ...
##  $ revol_bal          : num [1:40000] 10839 690 8057 18382 4124 ...
##  $ revol_util         : num [1:40000] 28.7 3.4 42.6 85.1 19.3 36 74.1 22.7 60.1 57.4 ...
##  $ total_acc          : num [1:40000] 28 16 18 18 10 9 7 17 15 10 ...
##  $ total_rec_int      : num [1:40000] 1153 705 1088 338 142 ...
##  $ application_type   : chr [1:40000] "Individual" "Individual" "Individual" "Individual" ...
##  $ tot_cur_bal        : num [1:40000] 90776 199277 148632 23795 4124 ...
##  $ total_rev_hi_lim   : num [1:40000] 37745 20400 18900 21600 21400 ...
##  $ Status             : num [1:40000] 0 0 0 0 0 0 0 0 0 0 ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   loan_amnt = col_double(),
##   ..   int_rate = col_double(),
##   ..   grade = col_character(),
##   ..   home_ownership = col_character(),
##   ..   annual_inc = col_double(),
##   ..   verification_status = col_character(),
##   ..   purpose = col_character(),
##   ..   dti = col_double(),
##   ..   open_acc = col_double(),
##   ..   revol_bal = col_double(),
##   ..   revol_util = col_double(),
##   ..   total_acc = col_double(),
##   ..   total_rec_int = col_double(),
##   ..   application_type = col_character(),
##   ..   tot_cur_bal = col_double(),
##   ..   total_rev_hi_lim = col_double(),
##   ..   Status = col_double()
##   .. )
##  - attr(*, "problems")=<externalptr>

Data quality issues - Checking for NAs

We check the presence of NAs in each of the variables included in the dataset. There are no NAs values in this dataset.

knitr::kable(apply(data, 2, function(x) any(is.na(x))))
x
loan_amnt FALSE
int_rate FALSE
grade FALSE
home_ownership FALSE
annual_inc FALSE
verification_status FALSE
purpose FALSE
dti FALSE
open_acc FALSE
revol_bal FALSE
revol_util FALSE
total_acc FALSE
total_rec_int FALSE
application_type FALSE
tot_cur_bal FALSE
total_rev_hi_lim FALSE
Status FALSE

What data types are included in the data set?

Now we have 12 numeric and 5 character variables.

overview <- overview(data)
plot(overview)

***

Transform some variables

We transform the characteristic variables in factors to count the categories and order them.

data$grade = as.factor(data$grade)
data$home_ownership = as.factor(data$home_ownership)
data$verification_status = as.factor(data$verification_status)
data$purpose = as.factor(data$purpose)
data$application_type = as.factor(data$application_type)
data$Status = as.factor(data$Status)

data <- data %>%
  select(order(sapply(., is.factor)),order(sapply(., is.numeric)))
overview <- overview(data)
plot(overview)

***

Summary of variables

Nummeric Variables

In most numerical variables there is a large gap between the minimum and maximum. For example, “loan-amnt” (amount of the loan applied for by the borrower) has a minimum of 1,000 and a maximum of 40,000, or “revol_bal” (Total credit revolving balance) from USD 0 to USD 78,762. The average interest rate “int_rate” is around 12.63%, with values between 5.31% and 27.49%. The annual income “annual_inc” of borrowers varies greatly, with an average of around USD 63,277. There are outliers with very high annual salaries. There are borrowers with a dti of 0, which could indicate low indebtedness.

Variable “purpose”

The Variable “purpose” (category provided by the borrower for the loan request) has many categories. They contain the name of the type of loan, except for one group. This group is labeled as “other” and contains 2,283 values. Most loans are used for debt consolidation and credit cards.

Variable “grade”

The most people are graded between “B” and “C”, in the grades “A” or “B” are similar number of people. The variable “grade” assigned loan grade by the financial service provider.

Variable “home_ownership”

The most people are in rent or has a mortgage for there home. 3,982 people are home owner. 14,278 people from 40,000 aren’t verified.

Variable “verification_status”

We see that 14,278 people are not verifide from 40,000 people. 16,129 are source verifide.

Variable “application_type”

Only 530 joined via App from 40,000 people in the System.

Variable “Status”

The target variable “Status” is unbalanced, as there are more loans without default (status 0 = 34,794 persons) than with default (status 1 = 5,206).

summary(data)
##    loan_amnt        int_rate       annual_inc          dti       
##  Min.   : 1000   Min.   : 5.31   Min.   :  6600   Min.   : 0.00  
##  1st Qu.: 7000   1st Qu.: 9.44   1st Qu.: 42000   1st Qu.:12.17  
##  Median :10050   Median :12.29   Median : 57000   Median :17.67  
##  Mean   :11682   Mean   :12.63   Mean   : 63277   Mean   :18.24  
##  3rd Qu.:15125   3rd Qu.:15.05   3rd Qu.: 77000   3rd Qu.:23.89  
##  Max.   :40000   Max.   :27.49   Max.   :400000   Max.   :60.14  
##                                                                  
##     open_acc       revol_bal       revol_util       total_acc    
##  Min.   : 1.00   Min.   :    0   Min.   :  0.00   Min.   : 3.00  
##  1st Qu.: 8.00   1st Qu.: 5619   1st Qu.: 34.80   1st Qu.:15.00  
##  Median :10.00   Median : 9760   Median : 52.50   Median :20.00  
##  Mean   :10.29   Mean   :11948   Mean   : 52.24   Mean   :21.27  
##  3rd Qu.:13.00   3rd Qu.:15792   3rd Qu.: 70.00   3rd Qu.:27.00  
##  Max.   :23.00   Max.   :78762   Max.   :123.20   Max.   :57.00  
##                                                                  
##  total_rec_int     tot_cur_bal     total_rev_hi_lim grade      home_ownership 
##  Min.   :   0.0   Min.   :     0   Min.   :   400   A: 7274   MORTGAGE:17736  
##  1st Qu.: 680.2   1st Qu.: 25136   1st Qu.: 12998   B:13263   OWN     : 3982  
##  Median :1345.5   Median : 53821   Median : 20700   C:11807   RENT    :18282  
##  Mean   :1820.6   Mean   : 99208   Mean   : 24089   D: 7656                   
##  3rd Qu.:2433.9   3rd Qu.:158638   3rd Qu.: 32000                             
##  Max.   :8834.9   Max.   :472573   Max.   :100000                             
##                                                                               
##       verification_status               purpose        application_type
##  Not Verified   :14278    debt_consolidation:23414   Individual:39470  
##  Source Verified:16129    credit_card       : 9362   Joint App :  530  
##  Verified       : 9593    other             : 2283                     
##                           home_improvement  : 2095                     
##                           major_purchase    :  807                     
##                           medical           :  445                     
##                           (Other)           : 1594                     
##  Status   
##  0:34794  
##  1: 5206  
##           
##           
##           
##           
## 

Balance of the target variable

In the next step, we investigate our target variable “Status”. We notice also before in our sample, that we have 5,206 persons which did not default on their loan and we have 34,794 which did default.

As we can see in the visualization the data set is highly imbalanced.

ggplot(data, aes(x = Status, fill = Status)) +
  geom_bar() +
  ylab("Count") +
  xlab("Status of the loan")

PercTable(data$Status)
##                
##     freq   perc
##                
## 0 34'794  87.0%
## 1  5'206  13.0%

In the next step, we carry-out under sampling and visualizate it again.

set.seed(7)
data_original <- data
data_balanced <- ovun.sample(Status ~ ., data=data, method = "under")
data_under <- data.frame(data_balanced[["data"]])


Visualization of the level of the target variable


Distribution of the numeric variables


Cheking for outliers

We provide a boxplot of the numeric variables in both the original and under-sampled dataset.


knitr::kable(diagnose_outlier(data_under), caption = "Diagnose Outlier", digits = 2)
Diagnose Outlier
variables outliers_cnt outliers_ratio outliers_mean with_mean without_mean
loan_amnt 237 2.28 32363.92 11919.02 11442.71
int_rate 175 1.68 25.85 13.56 13.35
annual_inc 316 3.04 163143.59 61085.99 57891.00
dti 25 0.24 47.76 18.98 18.91
open_acc 87 0.84 21.37 10.30 10.21
revol_bal 460 4.42 38905.07 11753.72 10498.49
revol_util 0 0.00 NaN 53.36 53.36
total_acc 47 0.45 49.19 20.94 20.82
total_rec_int 591 5.68 6653.93 1868.94 1580.94
tot_cur_bal 386 3.71 363182.06 91491.63 81029.48
total_rev_hi_lim 275 2.64 67840.00 23042.41 21826.89

Visualzation with and without the outliers.

We note that for the variables “annual_inc” (The self-reported annual income provided by the borrower during registration) the visualization changes considerably and there the median also tends to shift strongly.


Dealing with outliers

We do winsorizing for dealing with the highest outliers.

outlier <- function(x){
    quantiles <- quantile(x, c(.05, .95))
    x[x < quantiles[1]] <- quantiles[1]
    x[x > quantiles[2]] <- quantiles[2]
    x
}

data_new_under <- map_df(data_under[,-c(12:17)], outlier)
cols <- data_under[,c(12:17)]
data_new_under <- cbind(data_new_under, cols)
boxplot(scale(data_new_under[,c(1:11)]), use.cols = TRUE)

***





ggpairs(data[, c("loan_amnt", "int_rate", "annual_inc", "dti", "total_acc", "total_rec_int", "tot_cur_bal")], 
        aes(color = as.factor(data$Status)))


plot_ly(data, x = ~loan_amnt, y = ~annual_inc, mode = "markers",
                        type = "scatter", marker = list(color = "#00CED1")) %>%
  layout(title = "Scatter Plot of Loan Amount vs. Annual Income",
         xaxis = list(title = "Loan Amount"),
         yaxis = list(title = "Annual Income"))